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We derive sublinear regret bounds for undiscounted reinforcement learning in con- 
tinuous state space. The proposed algorithm combines state aggregation with the 
use of upper confidence bounds for implementing optimism in the face of uncer- 
tainty. Beside the existence of an optimal policy which satisfies the Poisson equa- 
. tion, the only assumptions made are Holder continuity of rewards and transition 

c/j ■ probabilities. 

1 Introduction 

Real world problems usually demand continuous state or action spaces, and one of the challenges for 
reinforcement learning is to deal with such continuous domains. In many problems there is a natural 
metric on the state space such that close states exhibit similar behavior. Often such similarities can 
be formalized as Lipschitz or more generally Holder continuity of reward and transition functions. 
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| The simplest continuous reinforcement learning problem is the 1 -dimensional continuum-armed 

, bandit, where the learner has to choose arms from a bounded interval. Bounds on the regret with 

respect to an optimal policy under the assumption that the reward function is Holder continuous have 
been given in |[T5l l4l. The proposed algorithms apply the UCB algorithm [2 1 to a discretization of the 
problem. That way, the regret suffered by the algorithm consists of the loss by aggregation (which 
can be bounded using Holder continuity) plus the regret the algorithm incurs in the discretized set- 
5_i , ting. More recently, algorithms that adapt the used discretization (making it finer in more promising 

■ regions) have been proposed and analyzed lfT6l l8l. 

While the continuous bandit case has been investigated in detail, in the general case of continuous 
state Markov decision processes (MDPs) a lot of work is confined to rather particular settings, pri- 
marily with respect to the considered transition model. In the simplest case, the transition function 
is considered to be deterministic as in |fl9l , and mistake bounds for the respective discounted set- 
ting have been derived in [6|. Another common assumption is that transition functions are linear 
functions of state and action plus some noise. For such settings sample complexity bounds have 
been given in 11231 1711. while 0(VT) bounds for the regret after T steps are shown in 0]. However, 
there is also some research considering more general transition dynamics under the assumption that 
close states behave similarly, as will be considered here. While most of this work is purely experi- 
mental lfT2ll24l . there are also some contributions with theoretical guarantees. Thus, iTPTl considers 
PAC-learning for continuous reinforcement learning in metric state spaces, when generative sam- 
pling is possible. The proposed algorithm is a generalization of the E 3 algorithm [ 14] to continuous 
domains. A respective adaptive discretization approach is suggested in 11201 . The PAC-like bounds 
derived there however depend on the (random) behavior of the proposed algorithm. 

Here we suggest a learning algorithm for undiscounted reinforcement learning in continuous state 
space. The proposed algorithm is in the tradition of algorithms like UCRL2 [ 1 1 ] in that it implements 
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the "optimism in the face of uncertainty" maxim, here combined with state aggregation. Thus, the 
algorithm does not need a generative model or access to "resets:" learning is done online, that is, in 
a single continual session of interactions between the environment and the learning policy. 

For our algorithm we derive regret bounds of 0(T( 2+a >'( 2+2a >) for MDPs with 1-dimensional 
state space and Holder-continuous rewards and transition probabilities with parameter a. These 
bounds also straightforwardly generalize to dimension d where the regret is bounded by 
Q(rp(2d+a)/(2d+2a)y Thus, m particular, if rewards and transition probabilities are Lipschitz, the 
regret is bounded by 0(7 1 ( 2d + 1 )/( 2,1 + 2 ))) in dimension d and 0(T 3 / 4 ) in dimension 1. We also 
present an accompanying lower bound of fl(yT), As far as we know, these are the first regret 
bounds for a general undiscounted continuous reinforcement learning setting. 

2 Preliminaries 

We consider the following setting. Given is a Markov decision process (MDP) M with state space 
S = [0, l] d and finite action space A. For the sake of simplicity, in the following we assume d = 1. 
However, proofs and results generalize straightforwardly to arbitrary dimension, cf. Remark|5]below. 
The random rewards in state s under action a are assumed to be bounded in [0, 1] with mean r(s,a). 
The transition probability distribution in state s under action a is denoted by p(-\s, a). 

We will make the natural assumption that rewards and transition probabilities are similar in close 
states. More precisely, we assume that rewards and transition probabilities are Holder continuous. 
Assumption 1. There are L,a > such that for any two states s, s' and all actions a, 

\r(s,a) -r{s',a)\ < L\s-s'\ a . 
Assumption 2. There are L, a > such that for any two states s, s' and all actions a, 

llpOlfl.oJ-pMa'.oJUj < L\s-s'\ a . 

For the sake of simplicity we will assume that a and L in Assumptions[T]and[2]are the same. 

We also assume existence of an optimal policy tt* : S —> A which gives optimal average reward 
p* — p* (M) on M independent of the initial state. A sufficient condition for state-independent 
optimal reward is geometric convergence of tt* to an invariant probability measure. This is a natural 
condition which e.g. holds for any communicating finite state MDP. It also ensures (cf. Chapter 10 
of [ 10 1) that the Poisson equation holds for the optimal policy. In general, under suitable technical 
conditions (like geometric convergence to an invariant probability measure p^) the Poisson equation 

Pre + A 7r (s) = r(s, 7r(s)) + / p(ds'\s, 7r(s)) • A 7r (s / ) (1) 

JS 

relates the rewards and transition probabilities under any measurable policy tt to its average re- 
ward Pk and the bias function A^ : S — >• M of tt. Intuitively, the bias is the difference in accumulated 
rewards when starting in a different state. Formally, the bias is defined by the Poisson equation (fl} 
and the normalizing equation LA^ dp n — (cf. e.g. (9)). The following result follows from the 
bias definition and Assumptions[T]and|2](together with results from Chapter 10 of ifTOl ). 

Proposition 3. Under Assumptions\l\and\2\ the bias of the optimal policy is bounded. 

Consequently, it makes sense to define the bias span H(M) of a continuous state MDP M satisfying 
Assumptions[T]and|2]to be H(M) := sup s A w » (s) — inf s A T * (s). Note that since inf s A^. (s) < 
by definition of the bias, the bias function A^. is upper bounded by H(M). 

We are interested in algorithms which can compete with the optimal policy tt* and measure their 
performance by the regret (after T steps) defined as Tp*{M) — Ylt=i r t> where r t is the random 
reward obtained by the algorithm at step t. Indeed, within T steps no canonical or even bias optimal 
optimal policy (cf. Chapter 10 of iflOl ) can obtain higher accumulated reward than Tp* + H(M). 

3 Algorithm 

Our algorithm UCCRl, shown in detail in FigureQ] implements the "optimism in the face of uncer- 
tainty maxim" just like UCRL2 flTT] or REGAL [5 |. It maintains a set of plausible MDPs M and 



2 



Algorithm 1 The UCCRl algorithm 

Input: State space S = [0, 1], action space A, confidence parameter 8 > 0, aggregation parame- 
ter n GN, upper bound H on the bias span, Lipschitz parameters L, a. 

Initialization: 

> Let h : = [0, |], Ij ~ & for J = 2,3, . . . ,n. 

> Set t := 1, and observe the initial state s\ and interval I(si). 

for episodes fc = 1, 2, . . . do 

> Let Nk (Ij , a) be the number of times action a has been chosen in a state € ij pn'or to 
episode fc, and Vk(Ij,a) the respective counts zn episode fc. 

Initialize episode fc: 

t> Set the start time of episode fc, t& := i. 

t> Compute estimates f fc (s, a) and p^ ss (Ii\s, a) for rewards and transition probabilities, using 
all samples from states in the same interval I(s), respectively. 

Compute policy 

t> Let Mk be the set of plausible MDPs M with H(M) < H and rewards r(s,a) and 
transition probabilities p(-\s, a) satisfyin 

|f(s, a) - fk(s, a) | 

|p ags (-|s,a)-^ gg (-|s,a) 

t> Choose policy fffe and Mk E Mk such that 

ftr fc (M fc ) = axgmax{p*(M) | M G (4) 

Execute policy 7Tfe: 

while v k (I(s t ),Ttk{st)) < max{l, N k (I(s t ), n k (s t ))} do 

> Choose action at = 7Tfe(st), obtain reward r t , and observe next state Sf+i. 
t> Set<:=< + 1. 

end while 
end for 



chooses optimistically an MDP M G and a policy 7r such that the average reward p%{M) is max- 
imized, cf. (0). Whereas for UCRL2 and REGAL the set of plausible MDPs is defined by confidence 
intervals for rewards and transition probabilities for each individual state-action pair, for UCCRl 
we assume an MDP to be plausible if its aggregated rewards and transition probabilities are within 
a certain range. This range is defined by the aggregation error (determined by the assumed Holder 
continuity) and respective confidence intervals, cf. (fj), (O. Correspondingly, the estimates for re- 
wards and transition probabilities for some state action-pair (s, a) are calculated from all sampled 
values of action a in states close to s. 

More precisely, for the aggregation UCCRl partitions the state space into intervals I\ :— [0, — ], 

Ik ■= -] for fc = 2,3,..., n. The corresponding aggregated transition probabilities are 

defined by 

p ass (/j|s,a) := / p{ds'\s,a). (5) 

Generally, for a (transition) probability distribution p(-) over S we write p ags (-) for the aggre- 
gated probability distribution with respect to {I\ ,l2...,I n }. Now, given the aggregated state space 
{Ji, I2 . . . , I n }, estimates r(s, a) and j3 agg (-|s, a) are calculated from all samples of action a in 
states in I(s), the interval Ij containing s. (Consequently, the estimates are the same for states in 
the same interval.) 

As UCRL2 and REGAL, UCCRl proceeds in episodes in which the chosen policy remains fixed. 
Episodes are terminated when the number of times an action has been sampled from some interval Ij 
has been doubled. Only then estimates are updated and a new policy is calculated. 



g 

< Ln~ a + \f ^?SWM^ , (2) 

— V 2max{l,iv fc (J(s).a)} 5 v * 

< Ln- a + J 56 ^ 2 ?u k ( \ ■ (3) 
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Since all states in the same interval Ij have the same confidence intervals, finding the optimal 
pair Mu , tth in © is equivalent to finding the respective optimistic discretized MDP Af^ ss and 
an optimal policy 7r^ ss on M^ sg . Then 7Tfc can be set to be the extension of 7r^ gs to S, that is, 
7Tfc(s) := 7r^ ss (/(s)) for all s. However, due to the constraint on the bias even in this finite case 
efficient computation of M^ 66 and 7r^ gs is still an open problem. We note that the REGAL.C algo- 
rithm |5| selects optimistic MDP and optimal policy in the same way as UCCRl. 

While the algorithm presented here is the first modification of UCRL2 to continuous reinforcement 
learning problems, there are similar adaptations to online aggregation [21] and learning in finite state 
MDPs with some additional similarity structure known to the learner [22 1 . 

4 Regret Bounds 

For UCCRl we can derive the following bounds on the regret. 

Theorem 4. Let M be an MDP with continuous state space [0, 1], A actions, rewards and transi- 
tion probabilities satisfying Assumptions\l\and\2\ and bias span upper bounded by H. Then with 
probability 1 — 5, the regret of UCCRl ( run with input parameters n and H) after T steps is upper 
bounded by 

const ■ nH^AT log (|) + const' ■ HLn~ a T. (6) 

Therefore, setting n = J i1 /( 2 + 2q ) gives regret upper bounded by 

const ■ HL^A\og{^) ■ r( 2 +<*)/( 2 + 2 °). 

With no known upper bound on the bias span, guessing H by log T one still obtains an upper bound 
on the regret of d(T < - 2+a ^ ( ~ 2+2a ^). 

Intuitively, the second term in the regret bound of (|6) is the discretization error, while the first term 
corresponds to the regret on the discretized MDP. A detailed proof of Theorem |4] can be found in 
Section |5]below. 

Remark 5 (d-dimensional case). The general d-dimensional case can be handled as described for 
dimension 1, with the only difference being that the discretization now has n d states, so that one 
has n d instead of n in the first term of ©. Then choosing n — T 1 /( 2d+2a ) bounds the regret by 

Remark 6 (unknown horizon). If the horizon T is unknown then the doubling trick (executing the 
algorithm in rounds i = 1,2,... guessing T = 2 l and setting the confidence parameter to 8/2 1 ) 
gives the same bounds. 

Remark 7 (unknown Holder parameters). The UCCRl algorithm receives (bounds on) the 
Holder parameters L as a as inputs. If these parameters are not known, then one can still obtain 
sublinear regret bounds albeit with worse dependence on T. Specifically, we can use the model- 
selection technique introduced in ifTTl . To do this, fix a certain number J of values for the constants 
L and a; each of these values will be considered as a model. The model selection consists in running 
UCCRl with each of these parameter values for a certain period of To time steps (exploration). Then 
one selects the model with the highest reward and uses it for a period of Tq time steps (exploitation), 
while checking that its average reward stays within (|6) of what was obtained in the exploitation 
phase. If the average reward does not pass this test, then the model with the second-best average 
reward is selected, and so on. Then one switches to exploration with longer periods t\, etc. Since 
there are no guarantees on the behavior of UCCRl when the Holder parameters are wrong, none 
of the models can be discarded at any stage. Optimizing over the parameters n and t[ as done 
in 1 17 1, and increasing the number J of considered parameter values, one can obtain regret bounds 
of 0(t( 2+2q )/( 2+3q )), or 0(T 4 / 5 ) in the Lipschitz case. For details see 07). Since in this model- 
selection process UCCRl is used in a "black-box" fashion, the exploration is rather wasteful, and 
thus we think that this bound is suboptimal. Recently, the results of IfTTl have been improved fljD, 
and it seems that similar analysis gives improved regret bounds for the case of unknown Holder 
parameters as well. 

The following is a complementing lower bound on the regret for continuous state reinforcement 
learning. 
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Theorem 8. For any A,H>1 and any reinforcement learning algorithm there is a continuous 
state reinforcement learning problem with A actions and bias span H satisfying Assumption\l\such 
that the algorithm suffers regret of fi(v HAT). 



Proof. Consider the following reinforcement learning problem with state space [0,1]. The state 
space is partitioned into n intervals Ij of equal size. The transition probabilities for each action a 
are on each of the intervals Ij concentrated and equally distributed on the same interval Ij. The 
rewards on each interval Ij are also constant for each a and are chosen as in the lower bounds for a 
multi-armed bandit problem [3 1 with nA arms. That is, giving only one arm slightly higher reward, 
it is known [3] that regret of nAT) can be forced upon any algorithm on the respective bandit 
problem. Adding another action giving no reward and equally distributing over the whole state 
space, the bias span of the problem is n and the regret f2(V HAT). □ 

Remark 9. Note that Assumption [2] does not hold in the example used in the proof of Theorem [8] 
However, the transition probabilities are piecewise constant (and hence Lipschitz) and known to 
the learner. Actually, it is straightforward to deal with piecewise Holder continuous rewards and 
transition probabilities where the finitely many points of discontinuity are known to the learner. If 
one makes sure that the intervals of the discretized state space do not contain any discontinuities, it 
is easy to adapt UCCRl and Theorem [4] accordingly. 

Remark 10 (comparison to bandits). The bounds of Theorems |4] and [8] cannot be directly com- 
pared to bounds for the continuous-armed bandit problem 1031 [4j [16] [8), because the latter is no 
special case of learning MDPs with continuous state space (and rather corresponds to a continuous 
action space). Thus, in particular one cannot freely sample an arbitrary state of the state space as 
assumed in continuous-armed bandits. 

5 Proof of Theorem H 

For the proof of the main theorem we adapt the proof of the regret bounds for finite MDPs in IfTTTl 
and Q. Although the state space is now continuous, due to the finite horizon T, we can reuse 
some arguments, so that we keep the structure of the original proof of Theorem 2 in liTTl . Some of 
the necessary adaptations made are similar to techniques used for showing regret bounds for other 
modifications of the original UCRL2 algorithm fl2Tll22l . which however only considered finite-state 
MDPs. 

5.1 Splitting into Episodes 

Let Vk(s, a) be the number of times action a has been chosen in episode k when being in state s, and 
denote the total number of episodes by m. Then setting := J2 S a v k(s, a)(p* — r(s, a)), with 



probability at least 1 — 12 j 5 / 4 the regret of UCCRl after T steps is upper bounded by (cf. Section 
4.1 oIlfTTl). 



5.2 Failing Confidence Intervals 

Next, we consider the regret incurred when the true MDP M is not contained in the set of plausi- 
ble MDPs Aik- Thus, fix a state-action pair (s, a), and recall that f(s, a) and j3 ags (-|s, a) are the 
estimates for rewards and transition probabilities calculated from all samples of state-action pairs 
contained in the same interval I(s). Now assume that at step t there have been N > samples of 
action a in states in J(s) and that in the z-th sample a transition from state Sj € J(s) to state s[ has 
been observed (i = 1, . . . , N). 

First, concerning the rewards one obtains as in the proof of Lemma 17 in Appendix C.l of ITTI — but 
now using Hoeffding for independent and not necessarily identically distributed random variables 
— that 




(7) 



Pr 



{ \f(s, a) - E[f (s, a)} \ > ^l g(W*)} 



S 



(8) 



< 



60nAt 7 ' 
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Concerning the transition probabilities, we have for a suitable x £ { — 1, 1}" 

n 

|^(-| S ,o)-E|^«(.| a| o)]| - ]T \p ass (Ij\s,a) -E[p^ {Ij \ S]a)] 

1 3=1 

n 



3 = 1 



N . 

&£(*(!&))- J s p(ds'\ si ,a)-x(l(s'))). 



(9) 



For any x € {— 1, l} n , JQ := x(/(s^))— J 5 p(ds'|sj, a)-a;(7(s / )) is a martingale difference sequence 

with |Xj| < 2, so that by Azuma-Hoeffding inequality (e.g., Lemma 10 in ATI '). Pr{ Yli=i -^i — 
6*} < exp(— 6 2 /8N) and in particular 



Pr { Ef =1 *« > v/56niVlog(Mi)} < (^) 



7n 



< 



2 n 20nAt 7 ' 

A union bound over all sequences x 6 {—1, 1}™ then yields from (O that 



Pr { \\ P ^(-\s, a) - E[p a ss(-| S , a)} ^ > ^ log (2f ) } 



< 



(10) 



Another union bound over all t possible values for N, all n intervals and all actions shows that the 
confidence intervals in (0 and ( fT0T > hold at time t with probability at least 1 — f or the actual 
counts N(I(s),a) and all state-action pairs (s, a). (Note that the equations I© and (1101) are the same 
for state-action pairs with states in the same interval.) 

Now, by linearity of expectation E[f (s, a)] can be written as Yli=i K s i> a). Since the s, ; are as- 
sumed to be in the same interval I(s), it follows that |E[f(s, a)] — r(s,a)\ < Ln~ a . Similarly, 
||E[p ags (-|s, a)] - p ags (-|s, a)\\ < Ln~ a . Together with ® and (TTOj this shows that with proba- 



bility at least 1 



15/' 



for all state-action pairs (s, a) 



\r(s,a)-r(s,a)\ < Ln a + yf 2 J£$ 2 £yff ia)} 
p ass {-\s,a)-p ass (-\s,a) 



< Ln~ a 



56rt \og{2 At/ 8) 



(ID 

(12) 



! v max{l,7V(7(s),a)} " 

This shows that the true MDP is contained in the set of plausible MDPs A4 (t) at step t with proba- 
bility at least 1 — j^, just as in Lemma 17 of ifTTl . The argument that 

m 

^AfclM^i < VT (13) 
k=i 

with probability at least 1 — 12 y 5 / 4 then can be taken without any changes from Section 4.2 of IfTTl . 
5.3 Regret in Episodes with M e M k 

Now for episodes with M € M. k , by the optimistic choice of Mk and 7r k in (0| we can bound 

A fc = y^ffc(s,7rfc(g))(p* - r(s,Tt k (s))) 

s 

< ^2vk(s,n k (s))(p* k -r(s,n k (s))) 

S 

= ^2 v k{s,TT k (s))(p* k -f k (s,n k (s))) +^2v k (s,TT k (s))(f k (s,TT k (s)) - r(s, # fc (s))) . 

s s 

Any term f k (s, a) — r(s, a) < \f k (s, a) — r k (s, a)\ + \ f k (s, a) — r(s, a) \ is bounded according to 
(O and (fTTT i. as we assume that M k , M € M. k , so that summarizing states in the same interval Ij 



A fc < £ v fe (s, ^ fe (s)){p* k - f fe (s, 7f fc (s))) + 2 £ £ «fc , a) ( Q + 

s J=l aG.4 



71og(2nAt fc /i5) 
2max{l,iV|,(; j ,a)} 
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Since max{l, Nk(Ij,a)} < t k < T, setting r k := t^+x — t k to be the length of episode k we have 

A fc < ^2v k (s,n k (s))(pl - r k {s,K k (s))) 

S 

We continue analyzing the first term on the right hand side of ([Pil l. By the Poisson equation (Q~|) for 
7Tfc on M k , denoting the respective bias by X k := \jt k we can write 

v k (s,Tt k (s))(p* k - r k (s,w k {s))) 



^w fc (s, n k (s))( / p fe (ds'|s,7ffe(s)) • Afe(s') - Afe(s)J 

^«fc(s, 7Tfc(s))^ y p(ds'|s,7rfc(s)) • Afc(s') - A fe (s)^ (15) 

n . 

+ y^ t; fc( s '^fc( s ))y^ / (i>fc(ds'|s,7rfc(s)) -p(ds'|s,7T fc (s))^ • Afe(s'). (16) 



.7=1 

5.4 The True Transition Functions 



Now ||p* K (.|a,a)-p'W50|a J o)|| 1 < \\Pl 6g (-\s, a) -j3f g (-|s, o)|| , + \\f^(-\s, «) -P ags ('k, o)|| 1 
can be bounded by (01 and (fTZt . because we assume M k ,M <E M. k . Hence, since by definition of 
the algorithm H bounds the bias function X k , the term in ([TBI is bounded by 



^2 v k (s,Tr k (s)) ^ / ^k(s')\Pk(ds'\s,% k (s)) - p(ds'\s,% k (s))\ 

3=1 Ij 

n 



< ^2v k (s,w k (s))-H-2(Ln 



3=1 

56ra \og(2AT/S) 



ma.x{l,N k (I(s),a t )} 



= 2HLn- a r k + AHJUn log ( m) V V - Ufc(/j ' a) == , (17) 

while for the term in (fT5T l 

X] u fc( s >^fc( s ))( y p(ds'|s,7r fe (s)) ' Afc(s') - Afc(s)) 
tk+i-i „ 

= ( p(ds'\st,at)-X k (s')-\ k (s t )\ 

t=t h J s 

= Y [ P(ds'\s t ,a t ) -h(s') -h(s t +i)j +X k (s tk+1 ) -h(s tk )- 
t=t k s 

Let k(t) be the index of the episode time step t belongs to. Then the sequence X t := 
f s p(ds'\st,a t ) ■ Afc(t)(s') — Afc( t )(s t+ i) is a sequence of martingale differences so that Azuma- 
Hoeffding inequality shows (cf. Section 4.3.2 and in particular eq. (18) in [11]) that after summing 
over all episodes we have 

Y ( H ( P( ds '\ s t, a t) -h(s') -h{s t +i)^ +h(s tk+1 ) -\k(s tk ) 

k=l ^ t=t k " ,<s 



<#J§Tlog(^)+tfnAlog 2 (f5), (18) 
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where the second term comes from an upper bound on the number of episodes, which can be derived 
analogously to Appendix C.2 of IfTTl . 



5.5 Summing over Episodes with M € Mk 

To conclude, we sum ( TBl i over all the episodes with M G Mk, using ( fT~5T >, ( fT7| ), and ( fl~8b . This 
yields that with probability at least 1 — 12 j? 5 / 4 

m m n ,j \ 

g <_ ,u^t + iH ■ E E E ^dTTOT 



Vk(Ij,a) 



m n 



+ 2Ln-T + ^14 log (^) E E E 7 ==fe= n - («0 
Analogously to Section 4.3.3 and Appendix C.3 of IfTTl . one can show that 

and we get from ( fT9l after some simplifications that with probability > 1 — 12 ^ 5/4 



EA fc lA/ e A4 fc < Hy/lTlog{^f)+HnAlog 2 (5) 



fe=i 



(4H+ l)yl4n log (242:) j(V2 + l)V^4T + 2(i? + l)Ln~ Q T . (20) 
Finally, evaluating (0 by summing over all episodes, by ( [T3l and (|20l we have with probability 



> 1 — 4T A 5/4 an upper bound on the regret of 



IT log (^) + E AfclA/^, + E AfelMGM, 

fc=l fe=l 



< Vf Tlog (f ) + VT + H J\ T\og (f ) + tfnAlog 2 (5) 



(4i? + 1)^1471 log (2£C)) (V2 + l)VnAT + 2(H + \)Ln~ a T. 

A union bound over all possible values of T and further simplifications as in Appendix C.4 of IfTTl 
finish the proof. □ 



6 Outlook 



We think that a generalization of our results to continuous action space should not pose any major 
problems. In order to improve over the given bounds, it may be promising to investigate more 
sophisticated discretization patterns. 

The assumption of Holder continuity is an obvious, yet not the only possible assumption one can 
make about the transition probabilities and reward functions. A more general problem is to assume 
a set T of functions, find a way to measure the "size" of T, and derive regret bounds depending on 
this size of T . 
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